Computational aspects of nuclear coupled-cluster theory 
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OQ We discuss computational aspects of the spherical coupled-cluster method specific to the 

» , nuclear many-body problem. Using chiral nucleon-nucleon interaction at next-to-next-to- 

next-to leading order (N 3 LO) with cutoff A = 500MeV, we present coupled-cluster results 

for the ground state of 40 Ca. Scaling and performance studies are presented together with 

challenges we meet with when extending the coupled-cluster effort to nuclei mass hundred 

\n and beyond. 

§1. Introduction 

The low-energy nuclear many-body problem is a challenging undertaking, how- 
ever, in the last decade there has been significant progress in first principle cal- 
culations of nuclei§>§>§M>§'0'§'§ With the advance of chiral effective field 
i "i theory^ IIIrllHrll^'lidrll^l 1 which allows for a systematic and consistent derivation of 

the nuclear forces rooted in low-energy Quantum-Chromo-Dynamics (QCD), and 
with the development of advanced many-body methods and high performance com- 
puting facilities, first principle computations of medium mass and neutron rich nuclei 
at the extremes of the nuclear chart are now within reach. The computational cost 
involved in these calculations grows rapidly as one moves beyond the lightest to- 
wards the medium mass region and beyond, and it has been crucial to implement 
techniques that scale gently with system size and code development that leverages 
the benefits of modern architecture at high-performance computing facilities. 

The coupled-cluster (CC) method is an optimal approach to medium mass and 
neutron rich nuclei as it is an ideal compromise between accuracy on the one hand 
and computational cost on the other. Coupled-cluster theory was introduced in 
nuclear physics in the late 1950's by Coester and KummeliSrllZf and was shortly 
thereafter introduced in quantum chemistry by Cizekli5rli2l Coupled-cluster theory 
has now been established as the "gold standard" for first principle computations 
in quantum chemistry, see RefHj 1 for a recent review. Only in the last decade 
has coupled-cluster reemerged in the nuclear physics community and has established 
itself as a state-of-the art approach to structure of medium mass and neutron rich 
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The nuclear coupled-cluster code suite developed at Oak Ridge National Lab- 
oratory, called NUCCOR (Nuclear Coupled-Cluster - Oak Ridge) has been further 
advanced under UNEDF (Universal Nuclear Energy Density Functional) a five-year 
SciDAC (Scientific Discovery through Advanced Computing) project H2M22P Both 
m-scheme (NUCCOR) and j-coupled (spherical) (NUCCOR-CCSD(T)) schemes for 
implementing CC at various levels of clustering approximation (i.e. single, doubles, 
triples, etc.) has been developed. Under UNEDF, considerable effort has been put 
into optimizing the NUCCOR-CCSD(T) VI. doubles kernel for the Jaguar Cray 
XT5 system, Oak Ridge Leadership Computing Facility's (OLCF) flagship leader- 
ship computing resource IHI 

To continue to achieve breakthrough discoveries in low-energy nuclear physics 
using CC, we need to push investigations to larger medium-mass nuclei in larger 
model spaces. This requires continued emphasis on scaling the CC algorithms and 
performance optimization. Additionally, Jaguar is presently undergoing a multi- 
phase upgrade to become Titan, a Cray XK6 with a single socket of AMD's 16-core 
interlagos chip with 32 GB of shared memory and another socket with the NVidia 
GPUs on a single node. Therefore we need to further develop and improve our 
implementation of coupled-cluster theory in preparation for the architectural change. 

This paper is organized as follows. In section 2 we provide an overview of the 
spherical coupled-cluster method used in NUCCOR-CCSD(T) and present conver- 
gence properties for the ground state of 40 Ca. Section 3 describes the computational 
considerations of the CC code suite and presents performance results on Jaguar XK6 
without GPUs, and describes future code developments necessary for continued re- 
search. Section 4 provides conclusions and outlook. 

§2. Nuclear coupled-cluster theory 

In coupled-cluster theory we write the j4-nucleon ground state wave function <Pb 
in the following form 

|^o) = e _t |^o), f = f 1 + f 2 + ... + f A . (2-1) 

Here |0o) is the uncorrelated (mean-field) reference state, and T is a linear expansion 

in particle-hole excitations operators. The ^-particle fc-hole (kp-kh) cluster operator 

is 

i 

fk = jm Yl t T i ::Z^---a ] ak a lk ...a il . (2-2) 

Here and in the following, i,j,k,... label occupied single-particle orbitals while 
a, b, c, . . . label unoccupied orbitals. From the exponential wave-function ansatz in 



Eq. (2-1) it follows directly that coupled-cluster theory is based on the similarity 
transform 

H^ = e- f H N e f (2-3) 

of the normal-ordered Hamiltonian H^. Here, the Hamiltonian is normal-ordered 
with respect to a reference state \cfto). The most commonly used approximation is 
coupled-cluster with singles-and-doubles excitations (CCSD) where T ss T\ + T%. 
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The unknown amplitudes tf and £"? in Eq. (|2-2|) are determined from the solution of 



tab 

the coupled-cluster equations 

= <#|fl]0o) , (24) 

= <<^ Wo) • (2-5) 

Here \c/)f) = a a &i\<j)o) is a lp-lh excitation of the reference state, and ]</$) is a 



similarly defined 2p-2h excited state. The CCSD equations (2-4) thus demand that 



the reference state \<po) has no lp-lh and no 2p-2h excitations, i.e. it is an eigenstate 



of the similarity transformed Hamiltonian (2-3) in the space of all lp-lh and 2p-2h 
excited states. Once the CCSD equations are solved, the ground-state energy is 
computed as 

E = {<fo\H\<t> Q ) . (2-6) 



Using Wick's theorem the coupled-cluster equations in (2-4) can be written as a set 



of coupled non-linear set of equations in the t\ and tfj amplitudes. However, this 
task is very tedious and error prone, and a much more direct and intuitive derivation 
procedes via a diagrammatic derivation using a set of well defined diagram rules (see 
e.g. Ref™). In the CCSD approximation the number of non-linear equations are 
given by n n u + n^n^. Here n is the number of occupied orbitals in the reference 
state and n u is the number of unoccupied states above the fermi level. In a typical 
case calculating the CCSD ground state of 40 Ca in 15 major oscillator shells we 
have a total of 40 occupied and 2680 unoccupied orbtials, resulting in a total of 
~ 10 10 non-linear coupled equations. This problem is too large to be solved by 
direct inversion techniques, and we use an iterative solution scheme together with 
krylov subspace methods such as BroyderuHlt^f or Direct-Inversion in the Iterative 
Sub-Space (DIIS) 34 ' as convergence accelators. 

The most expensive contribution to the T<i amplitude in the CCSD approxima- 
tion is __ 

cd 

it is clear that the computational cost associated with this contribution is r? n\. 
Coupled-cluster theory therefore scales polynomial with the system size, which is a 
rather soft scaling compared to the combinatorial scaling of methods such as the 



full configuration interaction (FCI) method. Equation (2-7) can be rewritten in a 
matrix-matrix multiplication form, so that we can utilize the optimized basic linear 
algebra library (BLAS)I^I 1 Although the scaling of the CC method is rather soft, it 
is clear that due to memory limitations we quickly reach the limit of the maximum 
model space that can be handled on modern computers. For example, in 15 major 
oscillator shells the two-body interaction alone would require about 600 TByte of 
memory alone and the T<i amplitudes would require 100 GByte of memory. 

In order to overcome this bottleneck, we recently derived and implemented the 
coupled-cluster equations in a coupled- angular momentum scheme pirLJ thereby uti- 
lizing the spherical symmetry of closed shell nuclei. For such nuclei, the cluster 



operator in Eq. (2-2) is a scalar under rotation, and depends only on reduced ampli- 



tudes. The reduced matrix elements of the Hamiltonian are stored in six different 
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blocks according to 



[pppp] , [hppp] , [hhpp] , [hphp] , [hhhp] , [hhhh] , 

where p denotes particle states and h denotes hole states. These matrices are sparse 
and block diagonal, each block is defined by a set of quantum numbers J W ,T Z . The 
largest matrices are the [pppp] and [hppp] are distributed among the processors by 
adding row by row to given processor until a criterion for optimal load balancing and 
memory distribution is reached, as illustrated in Fig. [T] By this distribution scheme 




Fig. 1. (Color online) Block diagonal structure of the interaction and parallel distribution scheme 



each processor will have a set of square and rectangular matrices for a subset of chan- 
nels (quantum numbers) J V ,T Z . By this scheme, we can utilize the optimized linear 
algebra libraries Bias and Lapadsu^P for matrix-matrix and matrix- vector operations. 
It is clear that the similarity transformed Hamiltonian is also a scalar under rotation 
and can be distributed in the same scheme, and it is straight forward (but somewhat 
tedious) to work out the CCSD equations within this formulation. 

As an example of the computational savings we achieve by switching from an 
uncoupled to a j-coupled scheme, we consider the case of 40 Ca in 15 major shells. In 
m— scheme we have a total of 2720 orbitals while in the coupled j-scheme we have 240 
orbitals, the number of non-linear CCSD equations in m-scheme is ~ 10 10 while in 
j— scheme we have ~ 10 6 . The various interaction blocks in Fig. |2| appear in various 
topologically different diagrams in the CCSD equations. Therefore it is very difficult 
to make the CCSD code scale optimally with increasing number of processors. We 
choose to distribute the blocks according to the number of computational cycles of 
the most expensive diagrams they appear in. Therefore scaling can only be optimal 
for these subsets of diagrams, while the remaining ones will scale less optimally. 

To illustrate the convergence properties of the coupled-cluster ground state as a 
function of model space size we compute the ground state of 40 Ca within /l-CCSD(T) 
approacho We start from the intrinsic Hamiltonian where the nucleon-nucleon 
interaction is given by the N 3 LO interaction (A x = 500 MeVc -1 ) by Entem and 
MachleidtMI'll^f The nucleon-nucleon interaction is given in relative coordinates, 
and in order to express it in a harmonic oscillator single-particle basis we need to 
transform it to the laboratory frame using the Brody-Moshinsky transformation!^ 
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This transformation depends on the number of partial waves we include in the relative 
coordinate frame given, and is given by J re i-max- In order to have fully converged 
calculations we need to make sure we have included sufficient number of partial waves 
in the Brody-Moshinsky transformation. The convergence is therefore two-fold, (i) 
as a function of the single-particle basis size, and (ii) as a function of number of 
relative partial waves included in the relative-to-laboratory frame transformation. 
We perform our calculations using a Hartree-Fock basis expressed in a harmonic 
oscillator single-particle basis with a frequency of hoj = 26 MeV. In Fig. [2] we show 
the convergence for the ground state of 40 Ca as a function of number of harmonic 
oscillator shells N = 2n-\-l (left figure), and as a function of Jrei-max for the Brody- 
Moshinsky transformation, keeping the single-particle space fixed at N = 16 shells 
(right figure). We see that N = 16 is sufficient size for the single-particle model space 
in order to converge the ground state to within 1 MeV, while we need to include 
relative partial waves up to J re i- m ax = 8, 10 to reach convergence for the Brody- 
Moshinsky transformation. In our earlier calculations^ 'I2P we used J rc i-max = 4, 
and therefore underestimated the binding energy of 40 Ca by ~ 15 MeV. 
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Fig. 2. Convergence of /l-CCSD(T) ground state energy of 40 Ca as a function N = 2n + 1 (left 
figure) and J ro i_ max (right figure) 



§3. Computational Considerations 



In coupled-cluster calculations, the computational challenge consists of scaling 
the computational resources with the increasing size of the model space and size 
of nuclei. For an interaction with momentum cutoff A and a nucleus with radius 
R, the size of the single-particle basis scales as (RA) 3 ~ ^4yl 3 P with A being the 
mass number. In 20 oscillator shells, the matrix elements of the two body-interaction 
require about 100 GByte of memory, and for tin isotopes, we expect that 25-30 shells 
will be required for present-day chiral interactions. 

The NUCCOR-CCSD(T) application has been developed for the Jaguar super- 
computer located at Oak Ridge National Laboratory. It utilizes MPI and threaded 
BLAS and LAPACK libraries, suited for the Cray XT5 architecture comprised of 
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dual hex-core processors with 16 GB of memory on the node. Jaguar is presently 
undergoing a phased upgrade from an XT5 to an eventual Cray XK6, called Titan, 
with a single interlagos 16-core chip with 32 GB of memory paired with NVidia 
GPUs. The current phase of the upgrade has fully deployed the CPU enhancements 
with the GPU additions expected to complete at the end of 2012. 

Hybrid programming, using both MPI and OpenMP, is more advantageous in 
the current configuration of the system due to the large shared memory and the 
large number of cores on a single node. In an effort to efficiently utilize this new 
architecture, we have made added further parallelism in the triples calculation us- 
ing OpenMP. This section presents performance results of the improved NUCCOR- 
CCSD(T) V2.0 application on the current Jaguar XK6 configuration, without GPUs. 

3.1. Scaling by model space 

Figure K^ shows strong scaling results for NUCCOR-CCSD(T) V2.0 using the 
PGI 11.10.0 compiler with optimization -fast and the Cray LibSci 11.0.4 scientific 
threaded library for 40 Ca in model spaces N = 12, 14, 16 and 18. Runtimes for the 
triples calculation are presented with total runtimes along with labels showing the 
percentage of runtime used in the triples calculation at various numbers of cores. 
These calculations utilize 4 MPI processes each spawning 4 threads using OpenMP 
and the threaded libraries to optimally use all cores on a node. 
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Fig. 3. (Color online) Strong scaling results for 40 Ca in model spaces N — 12, N — 14, N = 16, 
and N = 18. Dashed lines show triples runtime only and solid lines represent total runtime. 
Percentage labels show percent of total runtime spent in triples calculation. 

Seen in all model spaces, the runtime of NUCCOR-CCSD(T) V2.0 is dominated 
by the triples calculation at small numbers of processors, consuming roughly 90% 
of the total runtime. Utilizing more processors shows a dramatic decrease in the 
runtime of the triples calculation, which in turn decreases the overall runtime. At 
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larger numbers of processors, the triples calculation consumes less than 50% of the 
total runtime, indicating a necessity to revisit performance optimizations for other 
regions of the code, including the single and doubles calculation, setup, and I/O, 
which consume the remainder of the runtime. 
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Fig. 4. (Color online) Speedup from 8 nodes (128 cores) for 40 Ca in model spaces iV = 12 (red 
square), N = 14 (purple diamond), N — 16 (orange circle), and N = 18 (blue triangle). Dashed 
lines show triples runtime only and solid lines represent total runtime. 



The highest number of cores used in each model space shows the limits of strong 
scaling for both the triples and the total runtime. This is further evident in the 
speedup trends shown in Fig.[4j where the speedup at each model space drops sharply 
at the highest processor count for each model space. We use 128 cores for T(l) in 
the speedup calculations, which is the smallest configuration to run all model spaces. 
Figure [4] shows drastic speedup in the triples calculations with larger number of 
cores, but that the total runtime only improves by roughly half those values. With 
improvements in the triples calculation, which originally dominated the NUCCOR- 
CCSD(T) runtime, other regions of code utilizing MPI only will also require further 
parallelization to utilize the new architecture. 

3.1.1. MPI vs. Threading 

Although the use of threads through OpenMP has greatly improved the triples 
calculation and thus the overall runtime, Fig. [5] shows that increasing the number 
of threads does not improve the total runtime. In Fig. [5] we present scaling results 
using three configurations, varying the number of MPI processes and threads per 
process. We show that using 4 MPI processes and 4 threads performs equally well 
for the triples calculation as using 2 MPI processes and 8 threads. Due to the 
MPI only regions of code, yet to be optimized for the new architecture, the total 
runtime is worse using less MPI processes and more threads. Also shown in Fig. [5] 
is a degradation in performance when using more MPI processes and less threads (8 
MPI process and 2 threads) since it does not allow the benefits of threading to be 
realized. 
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Fig. 5. (Color online) Comparison of runtimes (in seconds) of NUCCOR-CCSD(T) V2.0 for 40 Ca, 
N = 16 with varying MPI processes and threads. Dashed lines show triples runtime only and 
solid lines represent total runtime. Thread configurations include: 8 MPI processes with 2 
threads each (blue dot), 4 MPI processes with 4 threads each (red square), and 2 MPI processes 
with 8 threads each (yellow triangle) on a node. All compute cores on the node are utilized. 



§4. Conclusions and future perspectives 

We have made substantial performance improvements in NUCCOR-CCSD(T) 
V2.0 by implementing threading in the triples calculation using OpenMP. For exam- 
ple, our triples calculation consumed 70 — 95% of the runtime, scaling to 8192 cores 
for iV=18 in 40 Ca. This example used 4 MPI processes each spawning 4 threads, 
resulting in a total number of 32768 requested MPI processes. Note that we typically 
need several runs of this size for each nucleus, in order to determine convergence and 
to map out the dependence on the model space parameters. 

These improvements now reveal new areas for continued development to optimize 
the total runtime performance to reach nuclei larger than 40 Ca in larger model spaces. 
Improvements include further threading MPI only regions, including the singles and 
doubles calculations, setup, and improvements to I/O. 

In preparation for utilizing the full capability of Titan, the hybrid CPU-GPU 
system, which will be available in Q4 of 2012, we are also going through the exercise 
of identifying compute-intensive kernels suitable to be off-loaded to the GPU and 
implementing GPU-optimized libraries, GPU directives, and GPU languages such as 
CUDA and OpenCL, as needed to continue to improve the models and implement 
more realistic system constraints. For example, the inclusion of continuum effects in 
weakly bound nuclei and nuclear reactions will require us to utilize a large number of 
scattering states which further increases the size of the model space. Increasing the 
order of the coupled-cluster approximation and including higher-body effects, such 
as three-body and four-body forces, present additional computational challenges to 
both developers and computing systems. 
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